{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial 08: Creating Custom Environments\n",
    "\n",
    "This tutorial walks you through the process of creating custom environments in Flow. Custom environments contain specific methods that define the problem space of a task, such as the state and action spaces of the RL agent and the signal (or reward) that the RL algorithm will optimize over. By specifying a few methods within a custom environment, individuals can use Flow to design traffic control tasks of various types, such as optimal traffic light signal timing and flow regulation via mixed autonomy traffic (see the figures below). Finally, these environments are compatible with OpenAI Gym.\n",
    "\n",
    "The rest of the tutorial is organized as follows: in section 1 walks through the process of creating an environment for mixed autonomy vehicle control where the autonomous vehicles perceive all vehicles in the network, and section two implements the environment in simulation.\n",
    "\n",
    "<img src=\"img/sample_envs.png\">\n",
    "\n",
    "\n",
    "## 1. Creating an Environment Class\n",
    "\n",
    "In this tutorial we will create an environment in which the accelerations of a handful of vehicles in the network are specified by a single centralized agent, with the objective of the agent being to improve the average speed of all vehicle in the network. In order to create this environment, we begin by inheriting the base environment class located in *flow.envs*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the base environment class\n",
    "from flow.envs import Env\n",
    "\n",
    "# define the environment class, and inherit properties from the base environment class\n",
    "class myEnv(Env):\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Env` provides the interface for running and modifying a SUMO simulation. Using this class, we are able to start sumo, provide a network to specify a configuration and controllers, perform simulation steps, and reset the simulation to an initial configuration.\n",
    "\n",
    "By inheriting Flow's base environment, a custom environment for varying control tasks can be created by adding the following functions to the child class: \n",
    "* **action_space**\n",
    "* **observation_space**\n",
    "* **apply_rl_actions**\n",
    "* **get_state**\n",
    "* **compute_reward**\n",
    "\n",
    "Each of these components are covered in the next few subsections.\n",
    "\n",
    "### 1.1 ADDITIONAL_ENV_PARAMS\n",
    "\n",
    "The features used to parametrize components of the state/action space as well as the reward function are specified within the `EnvParams` input, as discussed in tutorial 1. Specifically, for the sake of our environment, the `additional_params` attribute within `EnvParams` will be responsible for storing information on the maximum possible accelerations and decelerations by the autonomous vehicles in the network. Accordingly, for this problem, we define an `ADDITIONAL_ENV_PARAMS` variable of the form:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ADDITIONAL_ENV_PARAMS = {\n",
    "    \"max_accel\": 1,\n",
    "    \"max_decel\": 1,\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All environments presented in Flow provide a unique `ADDITIONAL_ENV_PARAMS` component containing the information needed to properly define some environment-specific parameters. We assume that these values are always provided by the user, and accordingly can be called from `env_params`. For example, if we would like to call the \"max_accel\" parameter, we simply type:\n",
    "\n",
    "    max_accel = env_params.additional_params[\"max_accel\"]\n",
    "\n",
    "### 1.2 action_space\n",
    "\n",
    "The `action_space` method defines the number and bounds of the actions provided by the RL agent. In order to define these bounds with an OpenAI gym setting, we use several objects located within *gym.spaces*. For instance, the `Box` object is used to define a bounded array of values in $\\mathbb{R}^n$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gym.spaces.box import Box"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition, `Tuple` objects (not used by this tutorial) allow users to combine multiple `Box` elements together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gym.spaces import Tuple"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have imported the above objects, we are ready to define the bounds of our action space. Given that our actions consist of a list of n real numbers (where n is the number of autonomous vehicles) bounded from above and below by \"max_accel\" and \"max_decel\" respectively (see section 1.1), we can define our action space as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class myEnv(myEnv):\n",
    "\n",
    "    @property\n",
    "    def action_space(self):\n",
    "        num_actions = self.initial_vehicles.num_rl_vehicles\n",
    "        accel_ub = self.env_params.additional_params[\"max_accel\"]\n",
    "        accel_lb = - abs(self.env_params.additional_params[\"max_decel\"])\n",
    "\n",
    "        return Box(low=accel_lb,\n",
    "                   high=accel_ub,\n",
    "                   shape=(num_actions,))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3 observation_space\n",
    "The observation space of an environment represents the number and types of observations that are provided to the reinforcement learning agent. For this example, we will be observe two values for each vehicle: its position and speed. Accordingly, we need a observation space that is twice the size of the number of vehicles in the network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class myEnv(myEnv):  # update my environment class\n",
    "\n",
    "    @property\n",
    "    def observation_space(self):\n",
    "        return Box(\n",
    "            low=0,\n",
    "            high=float(\"inf\"),\n",
    "            shape=(2*self.initial_vehicles.num_vehicles,),\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4 apply_rl_actions\n",
    "The function `apply_rl_actions` is responsible for transforming commands specified by the RL agent into actual actions performed within the simulator. The vehicle kernel within the environment class contains several helper methods that may be of used to facilitate this process. These functions include:\n",
    "* **apply_acceleration** (list of str, list of float) -> None: converts an action, or a list of actions, into accelerations to the specified vehicles (in simulation)\n",
    "* **apply_lane_change** (list of str, list of {-1, 0, 1}) -> None: converts an action, or a list of actions, into lane change directions for the specified vehicles (in simulation)\n",
    "* **choose_route** (list of str, list of list of str) -> None: converts an action, or a list of actions, into rerouting commands for the specified vehicles (in simulation)\n",
    "\n",
    "For our example we consider a situation where the RL agent can only specify accelerations for the RL vehicles; accordingly, the actuation method for the RL agent is defined as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class myEnv(myEnv):  # update my environment class\n",
    "\n",
    "    def _apply_rl_actions(self, rl_actions):\n",
    "        # the names of all autonomous (RL) vehicles in the network\n",
    "        rl_ids = self.k.vehicle.get_rl_ids()\n",
    "\n",
    "        # use the base environment method to convert actions into accelerations for the rl vehicles\n",
    "        self.k.vehicle.apply_acceleration(rl_ids, rl_actions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.5 get_state\n",
    "\n",
    "The `get_state` method extracts features from within the environments and provides then as inputs to the policy provided by the RL agent. Several helper methods exist within flow to help facilitate this process. Some useful helper method can be accessed from the following objects:\n",
    "* **self.k.vehicle**: provides current state information for all vehicles within the network\n",
    "* **self.k.traffic_light**: provides state information on the traffic lights\n",
    "* **self.k.network**: information on the network, which unlike the vehicles and traffic lights is static\n",
    "* More accessor objects and methods can be found within the Flow documentation at: http://berkeleyflow.readthedocs.io/en/latest/\n",
    "\n",
    "In order to model global observability within the network, our state space consists of the speeds and positions of all vehicles (as mentioned in section 1.3). This is implemented as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "class myEnv(myEnv):  # update my environment class\n",
    "\n",
    "    def get_state(self, **kwargs):\n",
    "        # the get_ids() method is used to get the names of all vehicles in the network\n",
    "        ids = self.k.vehicle.get_ids()\n",
    "\n",
    "        # we use the get_absolute_position method to get the positions of all vehicles\n",
    "        pos = [self.k.vehicle.get_x_by_id(veh_id) for veh_id in ids]\n",
    "\n",
    "        # we use the get_speed method to get the velocities of all vehicles\n",
    "        vel = [self.k.vehicle.get_speed(veh_id) for veh_id in ids]\n",
    "\n",
    "        # the speeds and positions are concatenated to produce the state\n",
    "        return np.concatenate((pos, vel))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.6 compute_reward\n",
    "\n",
    "The `compute_reward` method returns the reward associated with any given state. These value may encompass returns from values within the state space (defined in section 1.5) or may contain information provided by the environment but not immediately available within the state, as is the case in partially observable tasks (or POMDPs).\n",
    "\n",
    "For this tutorial, we choose the reward function to be the average speed of all vehicles currently in the network. In order to extract this information from the environment, we use the `get_speed` method within the Vehicle kernel class to collect the current speed of all vehicles in the network, and return the average of these speeds as the reward. This is done as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "class myEnv(myEnv):  # update my environment class\n",
    "\n",
    "    def compute_reward(self, rl_actions, **kwargs):\n",
    "        # the get_ids() method is used to get the names of all vehicles in the network\n",
    "        ids = self.k.vehicle.get_ids()\n",
    "\n",
    "        # we next get a list of the speeds of all vehicles in the network\n",
    "        speeds = self.k.vehicle.get_speed(ids)\n",
    "\n",
    "        # finally, we return the average of all these speeds as the reward\n",
    "        return np.mean(speeds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Testing the New Environment\n",
    "\n",
    "\n",
    "### 2.1 Testing in Simulation\n",
    "Now that we have successfully created our new environment, we are ready to test this environment in simulation. We begin by running this environment in a non-RL based simulation. The return provided at the end of the simulation is indicative of the cumulative expected reward when jam-like behavior exists within the netowrk. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from flow.controllers import IDMController, ContinuousRouter\n",
    "from flow.core.experiment import Experiment\n",
    "from flow.core.params import SumoParams, EnvParams, \\\n",
    "    InitialConfig, NetParams\n",
    "from flow.core.params import VehicleParams\n",
    "from flow.networks.ring import RingNetwork, ADDITIONAL_NET_PARAMS\n",
    "\n",
    "sim_params = SumoParams(sim_step=0.1, render=True)\n",
    "\n",
    "vehicles = VehicleParams()\n",
    "vehicles.add(veh_id=\"idm\",\n",
    "             acceleration_controller=(IDMController, {}),\n",
    "             routing_controller=(ContinuousRouter, {}),\n",
    "             num_vehicles=22)\n",
    "\n",
    "env_params = EnvParams(additional_params=ADDITIONAL_ENV_PARAMS)\n",
    "\n",
    "additional_net_params = ADDITIONAL_NET_PARAMS.copy()\n",
    "net_params = NetParams(additional_params=additional_net_params)\n",
    "\n",
    "initial_config = InitialConfig(bunching=20)\n",
    "\n",
    "flow_params = dict(\n",
    "    exp_tag='ring',\n",
    "    env_name=myEnv,  # using my new environment for the simulation\n",
    "    network=RingNetwork,\n",
    "    simulator='traci',\n",
    "    sim=sim_params,\n",
    "    env=env_params,\n",
    "    net=net_params,\n",
    "    veh=vehicles,\n",
    "    initial=initial_config,\n",
    ")\n",
    "\n",
    "# number of time steps\n",
    "flow_params['env'].horizon = 1500\n",
    "exp = Experiment(flow_params)\n",
    "\n",
    "# run the sumo simulation\n",
    "_ = exp.run(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Training the New Environment\n",
    "\n",
    "Next, we wish to train this environment in the presence of the autonomous vehicle agent to reduce the formation of waves in the network, thereby pushing the performance of vehicles in the network past the above expected return.\n",
    "\n",
    "The below code block may be used to train the above environment using the Proximal Policy Optimization (PPO) algorithm provided by RLlib. In order to register the environment with OpenAI gym, the environment must first be placed in a separate \".py\" file and then imported via the script below. Then, the script immediately below should function regularly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#############################################################\n",
    "####### Replace this with the environment you created #######\n",
    "#############################################################\n",
    "from flow.envs import AccelEnv as myEnv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note**: We do not recommend training this environment to completion within a jupyter notebook setting; however, once training is complete, visualization of the resulting policy should show that the autonomous vehicle learns to dissipate the formation and propagation of waves in the network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import ray\n",
    "from ray.rllib.agents.registry import get_agent_class\n",
    "from ray.tune import run_experiments\n",
    "from ray.tune.registry import register_env\n",
    "\n",
    "from flow.networks.ring import RingNetwork, ADDITIONAL_NET_PARAMS\n",
    "from flow.utils.registry import make_create_env\n",
    "from flow.utils.rllib import FlowParamsEncoder\n",
    "from flow.core.params import SumoParams, EnvParams, InitialConfig, NetParams\n",
    "from flow.core.params import VehicleParams, SumoCarFollowingParams\n",
    "from flow.controllers import RLController, IDMController, ContinuousRouter\n",
    "\n",
    "\n",
    "# time horizon of a single rollout\n",
    "HORIZON = 1500\n",
    "# number of rollouts per training iteration\n",
    "N_ROLLOUTS = 20\n",
    "# number of parallel workers\n",
    "N_CPUS = 2\n",
    "\n",
    "\n",
    "# We place one autonomous vehicle and 22 human-driven vehicles in the network\n",
    "vehicles = VehicleParams()\n",
    "vehicles.add(\n",
    "    veh_id=\"human\",\n",
    "    acceleration_controller=(IDMController, {\n",
    "        \"noise\": 0.2\n",
    "    }),\n",
    "    car_following_params=SumoCarFollowingParams(\n",
    "        min_gap=0\n",
    "    ),\n",
    "    routing_controller=(ContinuousRouter, {}),\n",
    "    num_vehicles=21)\n",
    "vehicles.add(\n",
    "    veh_id=\"rl\",\n",
    "    acceleration_controller=(RLController, {}),\n",
    "    routing_controller=(ContinuousRouter, {}),\n",
    "    num_vehicles=1)\n",
    "\n",
    "flow_params = dict(\n",
    "    # name of the experiment\n",
    "    exp_tag=\"stabilizing_the_ring\",\n",
    "\n",
    "    # name of the flow environment the experiment is running on\n",
    "    env_name=myEnv,  # <------ here we replace the environment with our new environment\n",
    "\n",
    "    # name of the network class the experiment is running on\n",
    "    network=RingNetwork,\n",
    "\n",
    "    # simulator that is used by the experiment\n",
    "    simulator='traci',\n",
    "\n",
    "    # sumo-related parameters (see flow.core.params.SumoParams)\n",
    "    sim=SumoParams(\n",
    "        sim_step=0.1,\n",
    "        render=True,\n",
    "    ),\n",
    "\n",
    "    # environment related parameters (see flow.core.params.EnvParams)\n",
    "    env=EnvParams(\n",
    "        horizon=HORIZON,\n",
    "        warmup_steps=750,\n",
    "        clip_actions=False,\n",
    "        additional_params={\n",
    "            \"target_velocity\": 20,\n",
    "            \"sort_vehicles\": False,\n",
    "            \"max_accel\": 1,\n",
    "            \"max_decel\": 1,\n",
    "        },\n",
    "    ),\n",
    "\n",
    "    # network-related parameters (see flow.core.params.NetParams and the\n",
    "    # network's documentation or ADDITIONAL_NET_PARAMS component)\n",
    "    net=NetParams(\n",
    "        additional_params=ADDITIONAL_NET_PARAMS.copy()\n",
    "    ),\n",
    "\n",
    "    # vehicles to be placed in the network at the start of a rollout (see\n",
    "    # flow.core.params.VehicleParams)\n",
    "    veh=vehicles,\n",
    "\n",
    "    # parameters specifying the positioning of vehicles upon initialization/\n",
    "    # reset (see flow.core.params.InitialConfig)\n",
    "    initial=InitialConfig(\n",
    "        bunching=20,\n",
    "    ),\n",
    ")\n",
    "\n",
    "\n",
    "def setup_exps():\n",
    "    \"\"\"Return the relevant components of an RLlib experiment.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    str\n",
    "        name of the training algorithm\n",
    "    str\n",
    "        name of the gym environment to be trained\n",
    "    dict\n",
    "        training configuration parameters\n",
    "    \"\"\"\n",
    "    alg_run = \"PPO\"\n",
    "\n",
    "    agent_cls = get_agent_class(alg_run)\n",
    "    config = agent_cls._default_config.copy()\n",
    "    config[\"num_workers\"] = N_CPUS\n",
    "    config[\"train_batch_size\"] = HORIZON * N_ROLLOUTS\n",
    "    config[\"gamma\"] = 0.999  # discount rate\n",
    "    config[\"model\"].update({\"fcnet_hiddens\": [3, 3]})\n",
    "    config[\"use_gae\"] = True\n",
    "    config[\"lambda\"] = 0.97\n",
    "    config[\"kl_target\"] = 0.02\n",
    "    config[\"num_sgd_iter\"] = 10\n",
    "    config['clip_actions'] = False  # FIXME(ev) temporary ray bug\n",
    "    config[\"horizon\"] = HORIZON\n",
    "\n",
    "    # save the flow params for replay\n",
    "    flow_json = json.dumps(\n",
    "        flow_params, cls=FlowParamsEncoder, sort_keys=True, indent=4)\n",
    "    config['env_config']['flow_params'] = flow_json\n",
    "    config['env_config']['run'] = alg_run\n",
    "\n",
    "    create_env, gym_name = make_create_env(params=flow_params, version=0)\n",
    "\n",
    "    # Register as rllib env\n",
    "    register_env(gym_name, create_env)\n",
    "    return alg_run, gym_name, config\n",
    "\n",
    "\n",
    "alg_run, gym_name, config = setup_exps()\n",
    "ray.init(num_cpus=N_CPUS + 1)\n",
    "trials = run_experiments({\n",
    "    flow_params[\"exp_tag\"]: {\n",
    "        \"run\": alg_run,\n",
    "        \"env\": gym_name,\n",
    "        \"config\": {\n",
    "            **config\n",
    "        },\n",
    "        \"checkpoint_freq\": 20,\n",
    "        \"checkpoint_at_end\": True,\n",
    "        \"max_failures\": 999,\n",
    "        \"stop\": {\n",
    "            \"training_iteration\": 200,\n",
    "        },\n",
    "    }\n",
    "})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
